“Everything is related to everything else, but near things are more related than distant things.”

First law of geography (Tobler 1970)1.

# Libraries
library(tidyverse)
library(spdep)
library(rgdal)
library(sf)
library(sp)
library(rgeos)
library(terra)
library(DT)

# Function that creates a palette from 
# yellow to red
colours_palette = colorRampPalette(c("#fedb71","#FCD471","#facd71",
                                     "#f6bf70","#eea26f","#EA946F",
                                     "#e6866e","#e2786e","#e0716e","#dd696d"))

Reading data

The necessary data for this type of analysis are:

# Read municipality data
tn <- readOGR("../data/aggregated_data_per_municipality", encoding="UTF-8")
## OGR data source with driver: ESRI Shapefile 
## Source: "G:\Il mio Drive\2nd Year\Geospatial\TrentinoSchools\data\aggregated_data_per_municipality", layer: "aggregated_data_per_municipality"
## with 166 features
## It has 12 fields
## Integer64 fields read as strings:  Popolazion Pop under
# Setting the CRS
tn <- spTransform(tn, CRS("+init=epsg:4326"))

# Function to replace NAs with 0
na.zero <- function (x) {
    x[is.na(x)] <- 0
    return(x)
}
names(tn@data)[3] = "Scuole"
names(tn@data)[7] = "Media_stud_classe"
names(tn@data)[8] = "Media_stud_scuola"
names(tn@data)[10] = "Pop_under_20"
names(tn@data)[11] = "Pop_under_20_Pop_tot"
names(tn@data)[12] = "Stud_Pop_under_20"
# Replace NAs about municipalities' number of schools with 0s
tn@data$Scuole = na.zero(tn@data$Scuole)

# Plot the municipalities in Trentino
par(mar=c(0,0,0,0))
plot(tn, axes = F, border="grey")

It is worth noticing from the map of school points that many of them overlap, especially in the Adige’s Valley and most populated areas, such as Trento, Rovereto and Riva del Garda.

# Import shapefile as SpatialPointsDataFrame
schools <- readOGR("../data/Trentino/schools/schools.shp",verbose = FALSE)

# Setting the CRS
schools <- spTransform(schools, CRS("+init=epsg:4326"))

# Plot schools over Trentino's map
par(mar=c(0,0,0,0))
plot(tn, border = "grey", axes = F)
points(schools@coords, col = alpha("#f27059",0.3), cex = 1, pch = 16)

Descriptive spatial statistics (global analysis)

Centroids

Before starting with the global analysis of Trentino municipalities, it is necessary to select some representative points for each municipality as a unique reference to the spatial coordinate.

par(mar=c(0,0,0,0))
plot(tn, border="grey")
points(coordinates(tn), 
       col="red", 
       bg = "#EF798A", 
       pch = 21,  
       lwd = 1.5)

Commonly, the centroid is a good choice, but still some problems can emerge and the centroid may not fall inside the boundaries of the territory. In this particular case, this may happen with multipolygons shape, i.e. those municipalities represented by multiple numbers of polygons that do not share a boundary. Some examples are the municipalities of Tione di Trento, Ronzone, Stenico, Calliano, Pellizzano and Riva del Garda. Some municipalities, such as Luserna, find their centroid in another territory because of their shape.

The plot depicts all those municipalities whose centroid is outside their actual territory. As can be seen, most of them have little zones outside the main one, while others (e.g. Predaia and Luserna) just have awkward shapes.

tn_coords = coordinates(tn)
tn@data$centroid = tn_coords
colorize = c()
for(i in 1:166){
  colorize = append(colorize, point.in.polygon(tn$centroid[i,1],
                                               tn$centroid[i,2],
                                               tn@polygons[i][[1]]@Polygons[[1]]@coords[,1],
                                               tn@polygons[i][[1]]@Polygons[[1]]@coords[,2]))
}
par(mar = c(0,0,0,0))
plot(tn, col = ifelse(colorize, "white","lightgrey"), border="grey")
text(tn_coords, labels = ifelse(colorize, "" ,tn@data$Comune), cex=0.6)

Instead of recurring to coordinates(), we could use gCentroid to obtain an alternative version of centroids. For most cases, these two versions coincide, but for those municipalities with multiple polygons points may differ (e.g. Soraga di Fassa in the upper right part or Tione di Trento). Since it computes a sort of mean point, making the centroid being part of other territories, the rest of the notebook will relate to the previous version (i.e. coordinates), despite inaccurate in some cases.

par(mar=c(0,0,0,0))
trueCentroids = gCentroid(tn,byid=TRUE)
plot(tn, border="grey")
points(coordinates(tn),pch=1, col="blue")
points(trueCentroids,pch=2, col="red")

⚠️ Note that, instead, the school dataset contains points and not polygons, therefore no problem occurs with the centroid computation. Also, since many schools have the same coordinates because, in the same building, the centroids will only consider unique points, reducing the dataset from 724 to 599 unique points.

K-Nearest Neighbour

Since there are various definitions of neighbourhood, we will try to explore three of them in the following sections, starting from K-nearest neighbour.

Since there is no way to choose a specific value for \(k\), we can iterate over a customized range, let’s say from 1 to 20 neighbours (i.e. schools). The following code generates an image for each value of \(k\).

# Saving schools coordinates
school_coords = coordinates(schools)
# Save frame per frame
for(i in 1:20){
    png(paste0("../viz/knn/",i,".png"),res=300, width=1000, height=1000)
    k <- knn2nb(knearneigh(school_coords, k = i, longlat=T))
    par(mar = c(0,0,0,0))
    plot(tn, border = "grey80", axis = tn, lwd=0.5)
    plot(k, school_coords, lwd=.6, col=alpha("#F27059",alpha=0.5), 
       cex = .3, add=TRUE, points=FALSE)
    dev.off()
}

⚠️ Note that through the function saveGIF()4 it would be possible to save frames as animation, lowering the resolution of the image obtained. Saving each frame will also allow users to select a customized value for \(k\) and look at how the map changes inside the website.

KNN on Trentino Schools

KNN on Trentino Schools

As \(k\) is increased, each school finds more and more neighbours, approaching also further points. If we focus on the first frame with \(k=1\), we may notice that some schools are isolated, since their closest neighbour requires long lines to be reached. An example is the municipality of Vermiglio, in the upper left part of Trentino, in the western boundary. There are, in fact, \(3\) schools in Vermiglio and one of them is near the boundary, far from other schools. A similar situation happens to Rabbi and Malé, to Luserna and Lavarone, but still Vermiglio is the municipality with the longest distance between schools within the local territory.

k <- knn2nb(knearneigh(school_coords, k = 1, longlat=T))
par(mar = c(0,0,0,0))
plot(tn, border = "grey80", axis = tn, lwd=0.5)
plot(k, school_coords, lwd=5, 
     col=alpha("#F27059",alpha=0.5), 
     cex = .8, add=TRUE, points=FALSE)
text(tn_coords, labels = ifelse(tn@data$Comune %in% c("Vermiglio","Rabbi",
                                                      "Malé","Luserna",
                                                      "Lavarone","Predaia"),
                                tn@data$Comune, ""), cex=0.6)

On the other hand, at the extreme opposite, with \(k=20\), we obtain the network of schools in Trentino, looking at the \(20\) closest schools around each point. Trento and Rovereto are the most intertwined zones, while green areas such as Adamello-Brenta Natural Park (west boundary) and Valsugana (empty zone in the right part of the map) lack in the number of schools. In fact, as can be seen by the length of lines, distances between schools tend to increase by moving from the Adige Valley to the east and west boundary.

KNN with k=20

KNN with k=20

If instead we focus on the municipalities, we can limit to a lower \(k\), since we talk about polygons with other territories at their boundary, while some schools’ points may result isolated with low values of \(k\). By looking at the \(k=5\) plot, we may notice how all municipalities are linked to each other, despite it does not seem so in the territory above Borgo Valsugana, where the Dolomites can be found.

knn1 = knn2nb(knearneigh(tn_coords, k = 1, longlat=T))
knn2 = knn2nb(knearneigh(tn_coords, k = 2, longlat=T))
knn3 = knn2nb(knearneigh(tn_coords, k = 3, longlat=T))
knn4 = knn2nb(knearneigh(tn_coords, k = 4, longlat=T))
knn5 = knn2nb(knearneigh(tn_coords, k = 5, longlat=T))

par(mar=c(0,0,0,0))
plot(tn, border = "grey80", axis = tn, lwd=0.5)
    plot(knn5, tn_coords, lwd=.6, col=alpha("#F27059",alpha=0.5), 
         cex = .3, add=TRUE, points=FALSE)

Critical cut-off

Our aim in this section will be to find the minimum threshold distance which allows all regions/points to have at least one neighbour. By setting \(k=1\) in the k-nearest neighbour, we can first compute the nearest neighbour to each school and the relative distance and then get the maximum distance among them.

# Critical cut-off on schools
knn1<- knn2nb(knearneigh(school_coords, k=1, longlat=T))
all.linked <- max(unlist(nbdists(knn1, school_coords, longlat=T))) 
all.linked
## [1] 9.182375

According to the results, all schools have a neighbour at least at 9.1823746 km. This implies that the cut-off distance has to be greater than it. However, notice from the following plot the distribution of school distances: the majority of them has a distance below the chilometer, following a long tail distribution. This may happen in big cities, as Trento and Rovereto, where there are a lot of schools and the minimum distance between them lowers.

distances = unlist(nbdists(knn1,school_coords,longlat=T))
ggplot()+
    geom_histogram(aes(x=distances), fill='#F27059', bins=50)+
    labs(title = "Distribution of distances between schools")+
  theme_minimal()

# Critical cut-off on municipalities
knn1<- knn2nb(knearneigh(tn_coords,k=1,longlat=T))
all.linked <- max(unlist(nbdists(knn1,tn_coords,longlat=T))) 
all.linked
## [1] 11.16466

We can repeat the same computation on municipalities centroids, discovering that every municipality has a neighbour at at least 11.1646609 km, slightly greater than the previous cut-off distance, which could mean that there are multiple schools in every municipality or that they are close enough to the boundary to be neighbour of other municipalities’ schools. Analyzing the distribution of distances between municipalities, we may notice that the majority of them distances from others from \(2\) to \(4\)km.

distances = unlist(nbdists(knn1,tn_coords,longlat=T))
ggplot()+
    geom_histogram(aes(x=distances), fill='#F27059', bins=50)+
    labs(title="Distribution of distances between municipalities' centroids")+
  theme_minimal()

We can try different neighbourhood definitions for different values of the cut-off distance, starting from the minimum threshold found before (i.e. \(9.18\)km).

dnb10 <- dnearneigh(school_coords, 0, 10, longlat=TRUE); dnb10
## Neighbour list object:
## Number of regions: 723 
## Number of nonzero links: 54618 
## Percentage nonzero weights: 10.44863 
## Average number of links: 75.54357
dnb15 <- dnearneigh(school_coords, 0, 15, longlat=TRUE); dnb15
## Neighbour list object:
## Number of regions: 723 
## Number of nonzero links: 93094 
## Percentage nonzero weights: 17.80923 
## Average number of links: 128.7607
dnb20 <- dnearneigh(school_coords, 0, 20, longlat=TRUE); dnb20
## Neighbour list object:
## Number of regions: 723 
## Number of nonzero links: 140366 
## Percentage nonzero weights: 26.85254 
## Average number of links: 194.1438
dnb25 <- dnearneigh(school_coords, 0, 25, longlat=TRUE); dnb25
## Neighbour list object:
## Number of regions: 723 
## Number of nonzero links: 191604 
## Percentage nonzero weights: 36.65456 
## Average number of links: 265.0124
dnb30 <- dnearneigh(school_coords, 0, 30, longlat=TRUE); dnb30
## Neighbour list object:
## Number of regions: 723 
## Number of nonzero links: 246372 
## Percentage nonzero weights: 47.13188 
## Average number of links: 340.7635

As the cut-off distance increases, the number of links grows rapidly. Based on the visualization, we could have stopped at 20, where nearly every school is connected to others.

plot_neighbour = function(model, coords, title){
    par(mar=c(0,0,1,0))
    plot(tn, border="grey",xlab="",ylab="",xlim=NULL)
    title(main=title, cex.main=0.8) 
    plot(model, coords, add=TRUE, col="#F27059", pch=16, lwd = 1, points=FALSE)
}

par(mfrow = c(3,2))
plot_neighbour(dnb10, school_coords, "d nearest neighbours, d = 10")
plot_neighbour(dnb15, school_coords, "d nearest neighbours, d = 15")
plot_neighbour(dnb20, school_coords, "d nearest neighbours, d = 20")
plot_neighbour(dnb25, school_coords, "d nearest neighbours, d = 25")
plot_neighbour(dnb30, school_coords, "d nearest neighbours, d = 30")

The same approach could be applied to municipalities data, obviously creating a network based on the territories around a certain area. Remembering that the cut-off threshold in this case is above \(11\), we can start with \(12\).

dnb12 <- dnearneigh(tn_coords, 0, 12, longlat=TRUE); dnb12
## Neighbour list object:
## Number of regions: 166 
## Number of nonzero links: 2000 
## Percentage nonzero weights: 7.257947 
## Average number of links: 12.04819
dnb16 <- dnearneigh(tn_coords, 0, 16, longlat=TRUE); dnb16
## Neighbour list object:
## Number of regions: 166 
## Number of nonzero links: 3314 
## Percentage nonzero weights: 12.02642 
## Average number of links: 19.96386
dnb20 <- dnearneigh(tn_coords, 0, 20, longlat=TRUE); dnb20
## Neighbour list object:
## Number of regions: 166 
## Number of nonzero links: 4866 
## Percentage nonzero weights: 17.65859 
## Average number of links: 29.31325
dnb24 <- dnearneigh(tn_coords, 0, 24, longlat=TRUE); dnb24
## Neighbour list object:
## Number of regions: 166 
## Number of nonzero links: 6712 
## Percentage nonzero weights: 24.35767 
## Average number of links: 40.43373
dnb30 <- dnearneigh(tn_coords, 0, 30, longlat=TRUE); dnb30
## Neighbour list object:
## Number of regions: 166 
## Number of nonzero links: 9652 
## Percentage nonzero weights: 35.02685 
## Average number of links: 58.14458
par(mfrow = c(3,2))
plot_neighbour(dnb12, tn_coords, "d nearest neighbours, d = 12")
plot_neighbour(dnb16, tn_coords, "d nearest neighbours, d = 16")
plot_neighbour(dnb20, tn_coords, "d nearest neighbours, d = 20")
plot_neighbour(dnb24, tn_coords, "d nearest neighbours, d = 24")
plot_neighbour(dnb30, tn_coords, "d nearest neighbours, d = 30")

Also in this case the number of connections grows rapidly, indicating how close the municipalities are between each other. Consider that the maximum distance within province of Trento between municipalities is around 120km (considering the centroids, therefore the actual distance is greater), but over the \(75\%\) of municipalities has an area below \(50km^2\), which allows them to be connected with brief distances.

# Quantiles of areas of municipalities within the Province of Trento
tn@data$area = round(area(tn)/ 1000000,3)
area = tn@data %>%
    arrange(desc(area)) %>%
    select(Comune, area) 
quantile(area$area)
##        0%       25%       50%       75%      100% 
##   1.65200  12.95100  25.57700  48.97075 199.55200
# Find max distance within the province centroids
library(geosphere)
diff = c()
for(i in 1:166 ){
  for (j in 1:166){
 diff = append(diff, distm(tn_coords[i,],tn_coords[j,], fun = distHaversine))
  }
}
# Maximum distance between centroids within the province of Trento
max(diff)/1000
## [1] 120.7196

Contiguity based approach

Since schools are depicted as points, we will use municipalities’ data to connect territories with common boundaries, i.e. multiple municipalities around. From the visualization, it is worth noticing the spiderweb created around the municipalities on the inside of Trentino, while those more disconnected are placed on the border of the Province, especially the territories on the upper right part of the map (e.g. Canazei, San Giovanni di Fassa, Mazzin, Campitello di Fassa, Soraga di Fassa, Moena).

par(mar=c(0,0,0,0))
contnb_q <- poly2nb(tn, queen=T)
plot(tn, border="grey")
plot(contnb_q, tn_coords, add=TRUE, col="#EF798A")
points(coordinates(tn), 
       col="red", 
       bg = "#EF798A", 
       pch = 21,  
       lwd = 1.5)

⚠️ Note that there are 166 municipalities in the Province of Trento. By sorting them according to the shape area, we get that the biggest areas do not share at least one boundary, since they take place on the border or in mountainous zones; while Trento occupies a central position.

area%>%
  DT::datatable()
Biggest municipalities in Trentino (in terms of area)

Biggest municipalities in Trentino (in terms of area)

Spatial Weights

After the definition of neighbourhoods, we can create spatial weights matrices, one for each neighbour list previously created.

# K-nearest neighbour
knn1.list = nb2listw(knn1)
knn2.list = nb2listw(knn2)
knn3.list = nb2listw(knn3)
knn4.list = nb2listw(knn4)
knn5.list = nb2listw(knn5)
# Critical cut-off
dnb12.list = nb2listw(dnb12,style="W")
dnb16.list = nb2listw(dnb16,style="W")
dnb20.list = nb2listw(dnb30,style="W")
dnb24.list = nb2listw(dnb24,style="W")
dnb30.list = nb2listw(dnb30,style="W")
# Contiguity based approach
contnb_q.list = nb2listw(contnb_q)

# List with weights lists and their names
weights = list(
    list(knn1.list, "K-nearest neighbour (k=1)"),
    list(knn2.list, "K-nearest neighbour (k=2)"),
    list(knn3.list, "K-nearest neighbour (k=3)"),
    list(knn4.list, "K-nearest neighbour (k=4)"),
    list(knn5.list, "K-nearest neighbour (k=5)"),
    list(dnb12.list, "Critical cut-off neighbourhood (d=12)"),
    list(dnb16.list, "Critical cut-off neighbourhood (d=16)"),
    list(dnb20.list, "Critical cut-off neighbourhood (d=20)"),
    list(dnb24.list, "Critical cut-off neighbourhood (d=24)"),
    list(dnb30.list, "Critical cut-off neighbourhood (d=30)"),
    list(contnb_q.list, "Contiguity-based neighbourhoord")
)

Moran’s I test of spatial autocorrelation

In the section, we will focus on Moran’s I test of spatial autocorrelation of Trentino Schools, in particular on the number of schools and students that populate every municipality. Let’s start plotting the distribution of some features over the territory, in terms of quantiles: light colours indicate low values, while reds indicate high values for a specific feature.

⚠️ Note that we do not hold data about students of all schools in Trentino, therefore some data might be missing. In these cases, the plots below will show the respective area in white.

cols = list(tn$Scuole,tn$Studenti, tn$Classi, tn$Media_stud_scuola,
            tn$Pop_under_20_Pop_tot, tn$Stud_Pop_under_20, tn$Media_stud_classe)
titles = c("Schools","Students","Classes","Mean of students per school",
           "Population under 20 over Total Population", 
           "Students over Population under 20", "Mean students per class")
colours <- colours_palette(10)

na.ignore = function(x){
  x[is.na(x)] <- -1
  return(x)
}

par(mfrow=c(4,2),mar = c(0,0,1.7,0))
for(i in 1:7){
    c = na.ignore(unlist(cols[i]))
    brks <- round(quantile(c, seq(0,1,0.1)), digits=3)
    plot(tn, col=ifelse(c==-1,
                        "#ffffff",
                        colours[findInterval(c, brks, all.inside=TRUE)]),
         main = titles[i], cex.main=2.5)
}

Now we can try to compute the Moran’s test based on all the previous definitions of neighborhood and with the previous features exposed, trying to find some spatial autocorrelation.

Neighbourhood = c()
Column = c()
Sd = c()
p_value = c()
Moran_I_statistic = c()
Mean = c()
Var = c()
Assumption = c()

# Iterate over columns
for(i in 1:length(cols)){
  # Iterate over neighbourhood
  for (w in weights) {
    c = na.zero(unlist(cols[i]))
    # Iterate over assumptions
    for (rand in c(T,F)) {
      Neighbourhood = append(Neighbourhood, w[[2]])
      res = moran.test(c, w[[1]], randomisation = rand)
      Column = append(Column, titles[i])
      Sd = append(Sd, round(as.numeric(res[[1]]),4))
      p_value = append(p_value, round(as.numeric(res[[2]]), 4))
      Moran_I_statistic = append(Moran_I_statistic, round(as.numeric(res[[3]][1]),4))
      Mean = append(Mean, round(as.numeric(res[[3]][2]),4))
      Var = append(Var, round(as.numeric(res[[3]][3]),4))
      if(rand) {
        Assumption = append(Assumption, "Randomization")
      }else{
        Assumption = append(Assumption, "Normality")
      }
    }
    Neighbourhood = append(Neighbourhood, w[[2]])
    res = moran.mc(c, w[[1]], nsim=999)
    Column = append(Column, titles[i])
    Sd = append(Sd,round(as.numeric(res[[1]]),4))
    p_value = append(p_value, round(res$p.value),4)
    Moran_I_statistic = append(Moran_I_statistic, round(res$statistic,4))
    Mean = append(Mean, "")
    Var = append(Var, "")
    Assumption = append(Assumption, res$method)
  }
}

# create df with results and show them
res_df = data.frame(Column, Neighbourhood, Moran_I_statistic, p_value,
                    Sd, Mean, Var, Assumption)

res_df %>%
  arrange(desc(abs(Moran_I_statistic)), p_value) %>%
  DT::datatable()

By ordering results according to the absolute value of the Moran’s I statistics, we can see that the highest statistics obtained are around the \([0.1789,0.1952]\) interval, got in all three assumptions. Albeit mean students per school seem to be the column with the highest Moran’s statistics, the p-value is not below the threshold of \(0.05\) for the majority of its observations. The mean of students per school proves to be significant with neighbourhoods knn \((k=5)\) and critical cut-off with \(d=16\).

On the other hand, the proportion of Population under 20 over the total population seems significant with every configuration, except for some knns neighbourhoods. However, Moran’s statistics are lower than the ones gathered considering the mean of students per school (below \(0.1\)).

Both these two columns show a positive spatial autocorrelation, while negative ones are associated with the number of Students over the Population under 20, with a low p-value and minimum value \(−0.1218\) for Moran’s I statistics.

res_df%>%
  group_by(Column) %>%
  summarise("Median Moran" = median(Moran_I_statistic), "Median p-value" =  median(p_value), 
            "Mean Moran" = round(mean(Moran_I_statistic),4), "Mean p-value" = round(mean(p_value),4)) %>%
  DT::datatable()

Considering an aggregated table with median and mean statistics and p-value, we can confirm that the columns with the highest statistics are:

The low mean and median p-values for Students corroborate the absence of spatial autocorrelation based on the number of students. Nearly the same happens to the number of Classes and Schools.

Moran’s I test of spatial autocorrelation in OLS residuals

Residuals test with Mean of students per class

Let’s start by modelling the mean students per class with all the remaining features we have explored. The summary shows that all predictors are meaningful, except for the population under 20 over the total population and the number of students.

LinearMean <- lm(Media_stud_classe ~ Stud_Pop_under_20+Scuole+Classi+Media_stud_scuola+Pop_under_20_Pop_tot+Studenti, tn, na.action = na.ignore)
summary(LinearMean)
## 
## Call:
## lm(formula = Media_stud_classe ~ Stud_Pop_under_20 + Scuole + 
##     Classi + Media_stud_scuola + Pop_under_20_Pop_tot + Studenti, 
##     data = tn, na.action = na.ignore)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.8239  -1.2867   0.1401   1.5105   8.4567 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           5.104501   1.491956   3.421 0.000792 ***
## Stud_Pop_under_20     6.089376   0.415496  14.656  < 2e-16 ***
## Scuole                0.923610   0.143476   6.437 1.37e-09 ***
## Classi               -0.403344   0.055234  -7.303 1.28e-11 ***
## Media_stud_scuola     0.042844   0.004195  10.214  < 2e-16 ***
## Pop_under_20_Pop_tot  5.217104  10.093143   0.517 0.605947    
## Studenti              0.014959   0.002785   5.371 2.74e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.369 on 159 degrees of freedom
## Multiple R-squared:  0.8922, Adjusted R-squared:  0.8881 
## F-statistic: 219.4 on 6 and 159 DF,  p-value: < 2.2e-16

With the step function, it is possible to simplify this model, considering only those features with high relevance, excluding those with no significance (i.e. Pop_under_20_Pop_tot).

# Searching for a simplified model where every feature has high significance
LinearMean = step(LinearMean)
## Start:  AIC=293.21
## Media_stud_classe ~ Stud_Pop_under_20 + Scuole + Classi + Media_stud_scuola + 
##     Pop_under_20_Pop_tot + Studenti
## 
##                        Df Sum of Sq     RSS    AIC
## - Pop_under_20_Pop_tot  1      1.50  893.96 291.49
## <none>                               892.46 293.21
## - Studenti              1    161.92 1054.38 318.89
## - Scuole                1    232.60 1125.06 329.66
## - Classi                1    299.32 1191.78 339.22
## - Media_stud_scuola     1    585.57 1478.03 374.96
## - Stud_Pop_under_20     1   1205.60 2098.06 433.11
## 
## Step:  AIC=291.49
## Media_stud_classe ~ Stud_Pop_under_20 + Scuole + Classi + Media_stud_scuola + 
##     Studenti
## 
##                     Df Sum of Sq     RSS    AIC
## <none>                            893.96 291.49
## - Studenti           1    160.59 1054.55 316.91
## - Scuole             1    240.91 1134.87 329.10
## - Classi             1    298.54 1192.50 337.32
## - Media_stud_scuola  1    677.89 1571.85 383.17
## - Stud_Pop_under_20  1   1204.65 2098.61 431.15
summary(LinearMean)
## 
## Call:
## lm(formula = Media_stud_classe ~ Stud_Pop_under_20 + Scuole + 
##     Classi + Media_stud_scuola + Studenti, data = tn, na.action = na.ignore)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.8010  -1.1654   0.1351   1.4944   8.4473 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        5.846845   0.403278  14.498  < 2e-16 ***
## Stud_Pop_under_20  6.077225   0.413879  14.684  < 2e-16 ***
## Scuole             0.932771   0.142051   6.566 6.85e-10 ***
## Classi            -0.402722   0.055094  -7.310 1.21e-11 ***
## Media_stud_scuola  0.043555   0.003954  11.015  < 2e-16 ***
## Studenti           0.014866   0.002773   5.361 2.85e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.364 on 160 degrees of freedom
## Multiple R-squared:  0.892,  Adjusted R-squared:  0.8887 
## F-statistic: 264.4 on 5 and 160 DF,  p-value: < 2.2e-16

The plot of the studentized residuals of the linear model can give us a hint about the presence of spatial dependence in the residuals.

par(mar=c(0,0,1,0))
studres <- rstudent(LinearMean)
resdistr <- round(quantile(studres, seq(0,1,0.25)), digits=3)
colours_5 <- colours_palette(5)
plot(tn, col=colours_5[findInterval(studres, resdistr, all.inside=TRUE)],
     main = "Residuals quantiles in Trentino",
     border="grey20")

The command that allows performing Moran’s I test in the OLS residuals is the function lm.morantest(). In the following chunk, the test to the studentized residuals of the linear Solow Model for different specifications of the spatial weights matrix. This method will be applied to all neighbourhoods definition, except KNN with \(k\) lower than \(4\) and the contiguity neighbourhood, since they return unusual results.

ols_res = data.frame(Neighbourhood = c(""),
                     Moran = c(""),
                     p_value = c(""))
# Moran test on residuals
for(i in 4:11){
  t = lm.morantest(LinearMean,weights[[i]][[1]],resfun=rstudent)
  ols_res = rbind(ols_res, c(weights[[i]][[2]],
                             t$estimate['Observed Moran I'],
                             t$p.value))
}
ols_res$Moran = as.numeric(ols_res$Moran)
ols_res$p_value = as.numeric(ols_res$p_value)
ols_res %>%
  arrange(desc(abs(Moran))) %>%
  filter(!is.na(Moran)) %>%
  DT::datatable()

The obtained results provide a negative spatial autocorrelation, but the p-value is far from being below the threshold to confirm this hypothesis. Therefore, no spatial autocorrelation is found in the residuals of this model.

Residuals test with Pop under 20 over total population

By focusing instead on the population under 20 over the total population for each municipality, we obtain from the summary that only the mean students per school is statistically significant.

LinearPop <- lm(tn$Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole + 
                  Classi + Media_stud_scuola + Media_stud_classe + Studenti, tn)
summary(LinearPop)
## 
## Call:
## lm(formula = tn$Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole + 
##     Classi + Media_stud_scuola + Media_stud_classe + Studenti, 
##     data = tn)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.049740 -0.010753  0.001237  0.010741  0.036054 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.389e-01  8.487e-03  16.367  < 2e-16 ***
## Stud_Pop_under_20 -3.945e-03  4.589e-03  -0.860  0.39156    
## Scuole             1.533e-03  1.154e-03   1.328  0.18647    
## Classi             2.348e-04  4.447e-04   0.528  0.59842    
## Media_stud_scuola  1.188e-04  4.321e-05   2.750  0.00684 ** 
## Media_stud_classe  4.199e-04  8.218e-04   0.511  0.61033    
## Studenti          -2.233e-05  2.094e-05  -1.066  0.28848    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01643 on 124 degrees of freedom
##   (35 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared:  0.1813, Adjusted R-squared:  0.1417 
## F-statistic: 4.576 on 6 and 124 DF,  p-value: 0.0003145

By applying the step function, the model is simplified to students, schools and mean of students per school. However, the adjusted r-squared is too low to guarantee the quality of this model.

# Searching for a simplified model where every feature has high significance
LinearPop = step(LinearPop)
## Start:  AIC=-1069.66
## tn$Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole + Classi + 
##     Media_stud_scuola + Media_stud_classe + Studenti
## 
##                     Df  Sum of Sq      RSS     AIC
## - Media_stud_classe  1 0.00007046 0.033543 -1071.4
## - Classi             1 0.00007526 0.033548 -1071.4
## - Stud_Pop_under_20  1 0.00019955 0.033672 -1070.9
## - Studenti           1 0.00030677 0.033779 -1070.5
## - Scuole             1 0.00047638 0.033949 -1069.8
## <none>                            0.033473 -1069.7
## - Media_stud_scuola  1 0.00204199 0.035515 -1063.9
## 
## Step:  AIC=-1071.39
## tn$Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole + Classi + 
##     Media_stud_scuola + Studenti
## 
##                     Df Sum of Sq      RSS     AIC
## - Classi             1 0.0000355 0.033579 -1073.2
## - Stud_Pop_under_20  1 0.0001498 0.033693 -1072.8
## - Studenti           1 0.0002511 0.033794 -1072.4
## <none>                           0.033543 -1071.4
## - Scuole             1 0.0007852 0.034328 -1070.3
## - Media_stud_scuola  1 0.0065399 0.040083 -1050.0
## 
## Step:  AIC=-1073.25
## tn$Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole + Media_stud_scuola + 
##     Studenti
## 
##                     Df Sum of Sq      RSS     AIC
## - Stud_Pop_under_20  1 0.0001149 0.033693 -1074.8
## <none>                           0.033579 -1073.2
## - Scuole             1 0.0009886 0.034567 -1071.5
## - Studenti           1 0.0012190 0.034798 -1070.6
## - Media_stud_scuola  1 0.0065150 0.040094 -1052.0
## 
## Step:  AIC=-1074.8
## tn$Pop_under_20_Pop_tot ~ Scuole + Media_stud_scuola + Studenti
## 
##                     Df Sum of Sq      RSS     AIC
## <none>                           0.033693 -1074.8
## - Scuole             1 0.0008972 0.034591 -1073.4
## - Studenti           1 0.0011299 0.034823 -1072.5
## - Media_stud_scuola  1 0.0067500 0.040443 -1052.9
summary(LinearPop)
## 
## Call:
## lm(formula = tn$Pop_under_20_Pop_tot ~ Scuole + Media_stud_scuola + 
##     Studenti, data = tn)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.049734 -0.010864  0.000615  0.010803  0.036535 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.421e-01  3.664e-03  38.783  < 2e-16 ***
## Scuole             1.778e-03  9.668e-04   1.839   0.0683 .  
## Media_stud_scuola  1.284e-04  2.546e-05   5.044 1.54e-06 ***
## Studenti          -1.183e-05  5.734e-06  -2.064   0.0411 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01629 on 127 degrees of freedom
##   (35 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared:  0.1759, Adjusted R-squared:  0.1564 
## F-statistic: 9.034 on 3 and 127 DF,  p-value: 1.818e-05
ols_res = data.frame(Neighbourhood = c(""),
                     Moran = c(""),
                     p_value = c(""))
# Moran test on residuals
for(i in 4:11){
  t = lm.morantest(LinearPop,weights[[i]][[1]],resfun=rstudent)

  ols_res = rbind(ols_res, c(weights[[i]][[2]],
                             t$estimate['Observed Moran I'],
                             t$p.value))
}
ols_res$Moran = as.numeric(ols_res$Moran)
ols_res$p_value = as.numeric(ols_res$p_value)
ols_res %>%
  arrange(desc(abs(Moran))) %>%
  filter(!is.na(Moran)) %>%
  DT::datatable()

Also in this case, the p-value is above \(0.05\), suggesting no spatial autocorrelation in the residuals. We can have a look at the plot to see how residuals are distributed all over the municipalities.

par(mar=c(0,0,1,0))
studres <- rstudent(LinearPop)
resdistr <- round(quantile(studres, seq(0,1,0.25)), digits=3)
plot(tn, col=colours_5[findInterval(studres, resdistr, all.inside=TRUE)],
     main = "Residuals quantiles in Trentino")

Local Analysis

Population under 20 over Total Population

Moran’s Scatterplot on Population under 20 over Total Population

The following grid of images shows Moran’s Scatterplot within all different notions of neighbourhoods. The membership of municipalities inside a quadrant or another changes according to the neighbourhood definition. For instance, Palù del Fersina and Sagron Mis often switch from LL to LH quadrant. Also, since this feature is a proportion that goes from \(0.08\) to \(0.19\), many municipalities overlap or assume the same value (columns of dots). Nevertheless, some outliers are noticeable, represented with a different style and their label.

As stated forehead, the quadrants crossed by the pink line and its confidence interval show the municipalities with similar proportion of students population over the total one (i.e. positive spatial autocorrelation). An example is the relationship between Novaledo and Porte di Rendena, Comano Terme and Dro, which are usually in the same HH quadrant and have a high proportion of students. On the opposite side, Cinte Tesino and Castello Tesino show a low value for students over the population.

On the other hand, municipalities on the remaining quadrants, have dissimilar values if they belong to opposite quadrants. For example, Primiero San Martino di Castrozza and Sagron Mis share a boundary, but their proportion of students is \(0.15\) and \(0.10\) respectively, enhancing a great dissimilarity between these two municipalities.

This chunk shows how many municipalities belong to a specific quadrant, based on the different concepts of neighbourhood. It seems like \(18\) or less over \(166\) municipalities have a spatial autocorrelation and belong to a specific quadrant. These hotspots are visualized with a colour, whose association is related to the membership to a specific quadrant, as described by the legend.

Since the membership to quadrants is divided over the three different concepts of neighbourhood, we can observe that the critical cut-off neighbourhood has no hotspot in the HH quadrant, which represents the majority of municipalities for the contiguity approach.

## quadrants.knn
##   HH   HL   LH   LL None 
##    2    5    3    8  148
## quadrants.dnb
##   HL   LH   LL None 
##    3    3    9  151
## quadrants.cont
##   HH   HL   LH   LL None 
##    3    4    4    2  153

To better visualize the position of the municipalities highlighted in previous scatterplots, the following map shows in white the municipalities with no quadrant and in colour the ones inside the Moran’s quadrants. In particular, this map will be served in three different configurations, one for each notion of the neighbourhood, choosing a random \(k\) for knn and \(d\) critical cut-off. This choice is because different neighbourhoods lead to different hotspots, i.e. the points that emerge over others in the Moran’s scatterplot.

⚠️ It is suggested to open the image in a new window to better read municipalities labels, whose size has been reduced for readability issues.

par(mfrow=c(3,1))
par(mar=c(0,0,0,0))
quadrants = list(list(quadrants.knn,"KNN"), list(quadrants.dnb,"Cut-off"), list(quadrants.cont,"Contiguity"))
for(l in quadrants){
  colourization = unlist(color_mapping[l[[1]]])
  par(mar=c(0,0,1,0))
  plot(tn, col=colourization, border = "grey", main=paste0("Regions with influence on students over population under 20 (neighbourhood = ",l[[2]],")"))
  legend(x=11.38, y=45.95, legend=c("Low-Low", "Low-High", "High-Low", "High-High","None"),
         fill=unlist(color_mapping), bty="n", cex=0.8)
  text(tn_coords, labels = ifelse(l[[1]]=="None", "" ,tn@data$Comune), cex=0.7)
}

As can be noticed, while critical cut-off and knn focus on the same eastern area of Trentino, the contiguity approach selects municipalities all over the province, in particular Trento and Rovereto, the most populated ones, and the least populated ones on the border.

While knn and contiguity show Novaledo, Dro and Ossana, KNN and Cut-off show municipalities like Mezzano (HL), Cinte Tesino (LL) and Palù del Fersina (LH) (and others not always highlighted before).

# Municipalities in common between KNN and cut-off
tn$Comune[quadrants.knn != "None" & quadrants.dnb != "None"]
## [1] "Canal San Bovo"      "Castello Tesino"     "Cinte Tesino"       
## [4] "Fai Della Paganella" "Imer"                "Mezzano"            
## [7] "Palù Del Fersina"    "Sagron Mis"          "Valfloriana"
# Novaledo in common in contiguity and knn
tn$Comune[quadrants.cont != "None" & quadrants.knn != "None"]
## [1] "Dro"      "Novaledo" "Ossana"
# Nothing in common between contiguity and cut-off
tn$Comune[quadrants.cont != "None" & quadrants.dnb != "None"]
## character(0)

Local Moran’s I

Local statistics can be tested for deviations using the hypothesis of absence of local spatial autocorrelation and hence can provide the statistical significance of the local spatial patterns detected by the Moran scatterplot. In particular, the function localmoran() provides:

  • Local Moran’s statistic \(I_i\);
  • the expectation of the statistic \(E(I_i)\);
  • the variance of the Local Moran’s statistic \(Var(I_i)\)
  • standard deviate of the Local Moran’s statistic \(Z_i\);
  • p-value of the Local Moran’s statistic \(P(z != E(I_i))\).

By sorting results according to the municipality with highest absolute value of the Local Moran statistic \(I_i\) and by filtering those whose p-value is below \(0.05\), we obtain the areas with highest statistical significance of the local spatial patterns, in a clearer way than the Moran scatterplot. Specifically, Sagron Mis, Cinte Tesino and Castello Tesino seem to be the municipalities with highest Local Moran’s, but comparing the top municipalities in lmI with those in the quadrants computed through the scatterplot, we may notice that only 35% of municipalities identified through the plot are spatially significant in the Local Moran’s.

lmI <- localmoran(tn$Pop_under_20_Pop_tot, dnb24.list, 
                  na.action = na.exclude)
lmI = data.frame(lmI) 
rownames(lmI) = tn$Comune

# List of municipalities ordered by Local Moran's I
# and below 0.05 p-value
lmI_sign = lmI%>%
  filter(lmI$Pr.z....E.Ii..<0.05) %>%
  arrange(desc(abs(Ii)))
DT::datatable(lmI_sign)
# Proportion of significant municipalities in Moran's Scatterplot
# versus those identified through Local Moran's
sum(c(tn$Comune[quadrants.knn != "None" | 
                                 quadrants.dnb != "None" | 
                                 quadrants.cont != "None"]) %in% rownames(lmI_sign))/dim(lmI_sign)[1]
## [1] 0.3461538

The following two plots show:

  • the territories where the Local Moran’s values are higher (darker colours) and lower (lighter colours). The labels are present only for those municipalities whose p-value is below \(0.05\). As can be noticed, labels are present in the extreme right of the Province of Trentino (e.g. Primiero San Martino di Castrozza, Castello Tesino, Sagron Mis etc), but also around the most populated areas (e.g. Riva del Garda and territories around Rovereto etc).
par(mar=c(0,0,1,0))
brks <- sort(as.numeric(lmI[,1]))
colours <- colorRampPalette(c('#fedb71','#dd696d'))(length(lmI[,1]))
plot(tn, col=colours[findInterval(lmI[,1], brks, all.inside=TRUE)],
     border="grey30")
title(main="Local Moran's I values")
text(tn_coords, labels = ifelse(tn@data$Comune %in% rownames(lmI_sign), tn@data$Comune ,""), cex=0.7)

  • This second plot only highlights the municipalities whose p-value is below the threshold of \(0.05\), allowing us to understand which areas have the lowest and the highest p-value and therefore the level of statistical significance of the spatial autocorrelation shown in the previous plot. Garniga Terme, Besenello, Cavalese, Mezzano, Ziano di Fiemme and Primiero San Martino di Castrozza are the municipalities with lowest p-value, but only Madruzzo and Besenello have a high value for the Local Moran’s \(I_i\), while others probably have a negative value for \(I_i\). Other territories around Riva del Garda, Rovereto and Castello Tesino have p-values around \(0.01\) and most of them positive spatial autocorrelation, except for Tesero, Terragnolo, Folgaria, Garniga Terme, Scurelle, Cavalese, Primiero San Martino di Castrozza, Mezzano, Ziano di Fiemme and Nomi (just order the table forehead showed in decreasing order with respect to lmI.
# Mapping the p-value as color
pvalue_colors = c("white","#fedb71", "#F6BF70", "#E6866E","#DD696D")
pval <- as.numeric(lmI[,5])
colpval = ifelse(pval>0.05, pvalue_colors[1],
                 ifelse(pval>0.01 & pval<=0.05, pvalue_colors[2],
                        ifelse(pval>0.001 & pval<=0.01,pvalue_colors[3],
                               ifelse(pval>0.0001,pvalue_colors[4],pvalue_colors[5])))) 
tn$colpval[pval>0.05] <- "white" 
tn$colpval[pval<=0.05 & pval>0.01] <- gray(0.9) 
tn$colpval[pval<=0.01 & pval>0.001] <- gray(0.7)
tn$colpval[pval<=0.001 & pval>0.0001] <- gray(0.4)
tn$colpval[pval<=0.0001] <- "black"

plot(tn, col=colpval)
legend(x=11.5, y=45.9, legend=c("Not significant", 
       "p-value = 0.05", "p-value = 0.01", "p-value = 0.001", 
       "p-value = 0.0001"), fill=pvalue_colors, bty="n", cex=1)
title(main="Local Moran's I significance map")
text(tn_coords, labels = ifelse(colpval == "white","", tn@data$Comune), cex=0.7)

Comparing the results obtained with the Scatterplots’, we can state that while the critical cut-off and knn found significance in the eastern part (e.g. Primiero, Predazzo, Castello Tesino etc), the contiguity approach helped to find significance near Riva del Garda and around Trento.

Mean of students per school

Moran’s Scatterplot

The following grid of images shows Moran’s Scatterplot within all different notions of neighbourhoods, considering the mean of students per school. As in the previous case, the pink line has a positive slope, whose confidence interval continues to grow by increasing the \(x\) value. Still, some differences can be noticed from one neighbourhood to another and some municipalities, like Mezzocorona, Villa Lagarina and San Michele All’Adige, continue to switch from above to below the mean dashed line.

In the HH quadrant, Trento, Rovereto, Mori, Mezzocorona, Avio, Lavis and Ala seem to be the municipalities with more students per school in the majority of plots, but also Brentonico, Giovo and Nago-Torbole, except when considering the critical cut-off neighbourhood.

On the LH part, communities like Drena (surrounded by Arco, Dro and Cavedine), Fornace (Baselga di Pinè and Civezzano), Ronzo-Chienis, Vallarsa (Ala, Rovereto) and Terragnolo (Rovereto, Folgaria) are those municipalities that have few students (less than \(100\) per school), but surrounded by municipalities with a lot of them.

This chunk shows how many municipalities belong to a specific quadrant, based on the different concepts of a neighbourhood. It seems like \(16\) or less over \(166\) municipalities have a spatial autocorrelation and belong to a specific quadrant. In particular, both knn and cut-off neighbourhoods do not provide hotspots for the LL quadrant, while it exceeds with HH municipalities, compared to contiguity neighbourhood.

table(quadrants.knn)
## quadrants.knn
##   HH   HL   LH None 
##   12    1    2  151
table(quadrants.cont)
## quadrants.cont
##   HH   HL   LH   LL None 
##    3    2    6    5  150
table(quadrants.dnb)
## quadrants.dnb
##   HH   HL   LH None 
##    6    2    3  155

The map displays the membership of municipalities to each of the quadrant in the Moran’s scatterplot, while in white leaving the remaining territories with no quadrant.

Starting with the KNN version, the Adige valley is completely in red, showing the spatial similarity of a high number of students per school: from Mezzocorona, to Trento, continuing to the neighbourhood of Rovereto and the bottom-centre part of Trentino.

The cut-off neighbourhood instead considers partially the same territories of KNN, with less focus on Rovereto and more on some LH municipalities, as Vallarsa, Trambileno and Terragnolo.

In the end, the contiguity approach detaches itself from previous approaches to show random municipalities, some in common with knn and cut-off (e.g. Roverè della Luna, Lavis, Trambileno, Nago-Torbole and Mori).

par(mfrow=c(3,1))
par(mar=c(0,0,0,0))
quadrants = list(list(quadrants.knn,"KNN"), list(quadrants.dnb,"Cut-off"), list(quadrants.cont,"Contiguity"))
for(l in quadrants){
  colourization = unlist(color_mapping[l[[1]]])
  par(mar=c(0,0,1,0))
  plot(tn, col=colourization, border = "grey", main=paste0("Regions with influence on mean number of students per school (neighbourhood = ",l[[2]],")"))
  legend(x=11.38, y=45.95, legend=c("Low-Low", "Low-High", "High-Low", "High-High","None"),
         fill=unlist(color_mapping), bty="n", cex=0.8)
  text(tn_coords, labels = ifelse(l[[1]]=="None", "" ,tn@data$Comune), cex=0.7)
}

# Municipalities in common between KNN and cut-off
tn$Comune[quadrants.knn != "None" & quadrants.dnb != "None"]
## [1] "Avio"                  "Lavis"                 "Levico Terme"         
## [4] "Mezzocorona"           "Mori"                  "San Michele All'Adige"
## [7] "Trento"                "Villa Lagarina"
# Municipalities in common in contiguity and knn
tn$Comune[quadrants.cont != "None" & quadrants.knn != "None"]
## [1] "Lavis"             "Mori"              "Nago-Torbole"     
## [4] "Roverè Della Luna"
# Municipalities in common between contiguity and cut-off
tn$Comune[quadrants.cont != "None" & quadrants.dnb != "None"]
## [1] "Lavis"      "Mori"       "Trambileno"

Local Moran’s I

This time, we will focus on the mean of students per school, whose values also include NAs that will be excluded through na.action parameter. As before, we create a dataframe ordered by the absolute value of Local Moran’s \(I_i\), filtered by the p-value below \(0.05\).

lmI <- localmoran(tn$Media_stud_scuola, dnb24.list, na.action = na.exclude)
lmI = data.frame(lmI) 
rownames(lmI) = tn$Comune

# Local Moran's I per each municipality
# below 0.05 p-value
lmI_sign = lmI%>%
  filter(lmI$Pr.z....E.Ii..<0.05) %>%
  arrange(desc(abs(Ii)))
DT::datatable(lmI_sign)
# Proportion of significant municipalities in Moran's Scatterplot
# versus those identified through Local Moran's
sum(c(tn$Comune[quadrants.knn != "None" | 
                                 quadrants.dnb != "None" | 
                                 quadrants.cont != "None"]) %in% rownames(lmI_sign))/dim(lmI_sign)[1]
## [1] 0.2926829

41 municipalities find a significant statistic for the local spatial autocorrelation and just \(29\%\) are in common with the ones found in the Moran scatterplots’ quadrants.

The following two plots show:

  • the territories where the Local Moran’s values are higher (darker colours) and lower (lighter colours). The labels are present only for those municipalities whose p-value is below \(0.05\). The white areas are those for which we do not detain students’ data. Most of the statistically significant municipalities are in the Valsugana or around the Rovereto/Riva del Garda zone.
par(mar=c(0,0,1,0))
brks <- sort(as.numeric(lmI[,1]))
colours <- colorRampPalette(c('#fedb71','#dd696d'))(length(lmI[,1]))
plot(tn, col=colours[findInterval(lmI[,1], brks, all.inside=TRUE)],
     border="grey30")
title(main="Local Moran's I values")
text(tn_coords, labels = ifelse(tn@data$Comune %in% rownames(lmI_sign), tn@data$Comune ,""), cex=0.7)

  • however, when looking at the p-value, Calliano, Terragnolo, Pomarolo, Volano and Nomi are the areas with the lowest p-value (therefore highest significance). All these municipalities can be found as Rovereto and Trento’s neighbours. However, Nomi and Terragnolo have a negative spatial autocorrelation (therefore fewer students per school with respect to their neighbours who have the highest values), while Pomarolo and Volano have higher local moran’s \(I_i\) than Calliano.
# Mapping the p-value as color
pvalue_colors = c("white","#fedb71", "#F6BF70", "#E6866E","#DD696D")
pval <- as.numeric(lmI[,5])
colpval = ifelse(pval>0.05, pvalue_colors[1],
                 ifelse(pval>0.01 & pval<=0.05, pvalue_colors[2],
                        ifelse(pval>0.001 & pval<=0.01,pvalue_colors[3],
                               ifelse(pval>0.0001,pvalue_colors[4],pvalue_colors[5])))) 
tn$colpval[pval>0.05] <- "white" 
tn$colpval[pval<=0.05 & pval>0.01] <- gray(0.9) 
tn$colpval[pval<=0.01 & pval>0.001] <- gray(0.7)
tn$colpval[pval<=0.001 & pval>0.0001] <- gray(0.4)
tn$colpval[pval<=0.0001] <- "black"

plot(tn, col=colpval)
legend(x=11.5, y=45.9, legend=c("Not significant", 
       "p-value = 0.05", "p-value = 0.01", "p-value = 0.001", 
       "p-value = 0.0001"), fill=pvalue_colors, bty="n", cex=1)
title(main="Local Moran's I significance map")
text(tn_coords, labels = ifelse(colpval == "white","", tn@data$Comune), cex=0.6)

These results deviate from those discovered through the scatterplot, since here two big clusters are evident, while in the plots with all neighbourhood definitions territories with spatial autocorrelation are quite sparse. Nevertheless, some territories are in common, especially in the Rovereto area.

Students over Population under 20

Moran’s Scatterplot

This last inquiry is due to the great gap discovered in the choropleth map produced through a python notebook that shows the proportion of actual students over the total population under 20 years old of every municipality. It seems in fact that some municipalities host more students than those who actually live in that area, leading to the possible conclusion that students may need to move from their city to another to go to school every day.

The remaining grid shows this time a negative trend of the relationship between x and the spatially lagged x, indicating that there are few close municipalities with a high number of students over the young population, but most of them share dissimilar values with their neighbours. This may be due to the problem highlighted before, related to the movement of students across different municipalities to go to school, especially middle, high and professional schools.

By assigning a quadrant to each municipality, we can easily access the number and the names of municipalities whose students come from other areas or that have the necessity of moving to go to school. In fact, as shown in the assignation of numbers to quadrants, the most populated areas are HL and LH, showing a trend of negative spatial autocorrelation.

table(quadrants.knn)
## quadrants.knn
##   HH   HL   LL None 
##    2    6    2  156
table(quadrants.cont)
## quadrants.cont
##   HH   HL   LH   LL None 
##    5    4    1    1  155
table(quadrants.dnb)
## quadrants.dnb
##   HH   HL   LH None 
##    2    6    1  157

In this case, the colour mapping has been changed to highlight the quadrants with most of the municipalities, whose values are dissimilar (i.e. HL in red, LH in yellow).

As in the previous two analyses of the scatterplot, knn and cut-off show more similarities than with the contiguity approach, thus pointing out the municipalities that host more students than those who actually live inside the territory (i.e. they welcome students from their neighbours).

The most critical situations are around Tione di Trento, Ossana and Cles, which have a high number of students over its under 20 population (over 2.3 times their actual young population). Near Ossana, the municipalities of Vermiglio, Peio and Rabbi register a density of students of 0.3, meaning that 60% of the young population has to move to Ossana to go to school.

The situation is even worst around Rovereto with Vallarsa, Trembileno and Terragnolo (more than 80% of young people goes to school somewhere else). Something similar but limited happens between Canazei and Giovanni di Fassa, since their closeness and the gap of students in the first against the abundance in the second may imply that students in Canazei move to Giovanni di Fassa to go to school.

color_mapping = list("LL" = "#F6Bf70",
                     "LH" = "#FEDB71",
                     "HL" = "#E0716E",
                     "HH" = "#E6866E",
                     "None" = "white")

par(mfrow=c(3,1))
par(mar=c(0,0,0,0))
quadrants = list(list(quadrants.knn,"KNN"),
                 list(quadrants.dnb,"Cut-off"),
                 list(quadrants.cont,"Contiguity"))

for(l in quadrants){
  colourization = unlist(color_mapping[l[[1]]])
  par(mar=c(0,0,1,0))
  plot(tn, col=colourization, border = "grey", 
       main=paste0("Regions with influence on students over population under 20 (neighbourhood = ",l[[2]],")"))
  legend(x=11.38, y=45.95, 
         legend=c("Low-Low", "Low-High", "High-Low", "High-High","None"),
         fill=unlist(color_mapping), bty="n", cex=0.8)
  text(tn_coords, labels = ifelse(l[[1]]=="None", "" ,tn@data$Comune), cex=0.7)
}

# Municipalities in common between KNN and cut-off
# (those that for sure host students from their neighbours)
tn$Comune[quadrants.knn != "None" & quadrants.dnb != "None"]
## [1] "Cles"            "Ossana"          "Tione Di Trento"
# Cinte Tesino in common in contiguity and knn
tn$Comune[quadrants.cont != "None" & quadrants.knn != "None"]
## character(0)
# Nothing in common between contiguity and cut-off
tn$Comune[quadrants.cont != "None" & quadrants.dnb != "None"]
## character(0)

Local’s Moran I

As in the previous section, also here not all municipalities detain data about students over the population under 20, therefore missing values are excluded. Since there are only few municipalities whose p-value is below \(0.05\) (i.e. Molveno, Cimone, Aldeno and Storo), the dataframe below accepts all rows whose p-value is below \(0.1\).

lmI <- localmoran(tn$Stud_Pop_under_20, dnb24.list, 
                  na.action = na.exclude)
lmI = data.frame(lmI) 
rownames(lmI) = tn$Comune

# Local Moran's I for each municipality
# filtered for those whose p-value is below 0.1
lmI_sign = lmI%>%
  filter(lmI$Pr.z....E.Ii..<0.1) %>%
  arrange(desc(abs(Ii)))
DT::datatable(lmI_sign)
# Proportion of significant municipalities in Moran's Scatterplot
# versus those identified through Local Moran's
sum(c(tn$Comune[quadrants.knn != "None" | 
                quadrants.dnb != "None" | 
                quadrants.cont != "None"]) %in%
      rownames(lmI_sign))/dim(lmI_sign)[1]
## [1] 0.125

16 municipalities find a significant statistic for the local spatial autocorrelation and just \(24\%\) are in common with the ones found in the Moran scatterplots’ quadrants.

The following two plots show:

  • the territories where the Local Moran’s values are higher (darker colours) and lower (lighter colours). As usual, the white areas represent missing values municipalities, while labels are indicated not for the zones with statistical significance for Local Moran’s I, but for those who show highest/lowest values for \(I_i\) (let’s say above \(0.2\) and below \(-0.2\)). In this way, territories like Tione di Trento, Ossana, Cles and Rovereto are highlighted as in the Moran’s scatterplot.
par(mar=c(0,0,1,0))
brks <- sort(as.numeric(lmI[,1]))
colours <- colorRampPalette(c('#fedb71','#dd696d'))(length(lmI[,1]))
plot(tn, col=colours[findInterval(lmI[,1], brks, all.inside=TRUE)],
     border="grey30")
title(main="Local Moran's I values")
text(tn_coords, labels = ifelse(lmI$Ii>=0.2 | lmI$Ii<=-0.2, tn@data$Comune ,""), cex=0.7)

  • however, looking at the p-value, we can notice how it is extremely high for all observations and therefore that no local spatial autocorrelation can be confirmed for the municipalities found in the previous plot, just for Molveno, Storo, Cimone and Aldeno. Given the absence of statistical significance, we could discard the hypothesis of students overcrowding in specific areas.
# Mapping the p-value as color
pvalue_colors = c("white","#fedb71", "#F6BF70", "#E6866E","#DD696D")
pval <- as.numeric(lmI[,5])
colpval = ifelse(pval>0.05, pvalue_colors[1],
                 ifelse(pval>0.01 & pval<=0.05, pvalue_colors[2],
                        ifelse(pval>0.001 & pval<=0.01,pvalue_colors[3],
                               ifelse(pval>0.0001,pvalue_colors[4],pvalue_colors[5])))) 
tn$colpval[pval>0.05] <- "white" 
tn$colpval[pval<=0.05 & pval>0.01] <- gray(0.9) 
tn$colpval[pval<=0.01 & pval>0.001] <- gray(0.7)
tn$colpval[pval<=0.001 & pval>0.0001] <- gray(0.4)
tn$colpval[pval<=0.0001] <- "black"

plot(tn, col=colpval, border="grey30")
legend(x=11.5, y=45.9, legend=c("Not significant", 
       "p-value = 0.05", "p-value = 0.01", "p-value = 0.001", 
       "p-value = 0.0001"), fill=pvalue_colors, bty="n", cex=1)
title(main="Local Moran's I significance map")
text(tn_coords, labels = ifelse(colpval == "white","", tn@data$Comune), cex=0.6)

Spatial Regression Models

Population under 20 over the total population

In this section, we will resume the conclusions obtained in the Moran’s I test of spatial autocorrelation in OLS residuals, in particular considering the Population under 20 over the total population model.

Let’s start with the SDM model. The lagsarlm()11 function provides Maximum likelihood estimation of spatial simultaneous autoregressive lag and spatial Durbin (mixed) models.

As shown in the summary, only the mean of students per school is a relevant predictor, none of the lag or the remaining predictors’ p-value is close to the \(0.05\) threshold to demonstrate statistical significance. Also, the overall p-value is above \(0.05\), therefore no significance is shown by the SDM.

library(spatialreg)

# Estimate the SDM model using the Maximum likelihood estimator
SDM <- lagsarlm(Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole +
                  Media_stud_scuola + Studenti, tn, listw=dnb16.list,
                type="mixed", na.action = na.ignore)
summary(SDM)
## 
## Call:lagsarlm(formula = Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + 
##     Scuole + Media_stud_scuola + Studenti, data = tn, listw = dnb16.list, 
##     na.action = na.ignore, type = "mixed")
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0627464 -0.0109416  0.0018043  0.0116222  0.0396180 
## 
## Type: mixed 
## Coefficients: (asymptotic standard errors) 
##                          Estimate  Std. Error z value  Pr(>|z|)
## (Intercept)            1.5640e-01  3.3312e-02  4.6949 2.668e-06
## Stud_Pop_under_20     -1.9432e-03  2.9864e-03 -0.6507    0.5152
## Scuole                 1.5599e-03  1.0880e-03  1.4338    0.1516
## Media_stud_scuola      1.2969e-04  3.2676e-05  3.9690 7.216e-05
## Studenti              -1.0379e-05  6.4067e-06 -1.6200    0.1052
## lag.Stud_Pop_under_20  3.5103e-05  1.3270e-02  0.0026    0.9979
## lag.Scuole            -5.2318e-03  4.0974e-03 -1.2769    0.2017
## lag.Media_stud_scuola  2.0009e-05  9.7257e-05  0.2057    0.8370
## lag.Studenti           3.2587e-05  2.4817e-05  1.3131    0.1892
## 
## Rho: -0.040499, LR test value: 0.037775, p-value: 0.84589
## Asymptotic standard error: 0.22107
##     z-value: -0.1832, p-value: 0.85464
## Wald statistic: 0.033561, p-value: 0.85464
## 
## Log likelihood: 430.3369 for mixed model
## ML residual variance (sigma squared): 0.00032791, (sigma: 0.018108)
## Number of observations: 166 
## Number of parameters estimated: 11 
## AIC: -838.67, (AIC for lm: -840.64)
## LM test for residual autocorrelation
## test value: 0.53283, p-value: 0.46542

Same principle applies to SAR, which has an even higher p-value than SDM, indicating the absence of statistical significance. The relevance of predictors remains unaltered, despite students and students over the young population are boundary situations (slightly above \(0.05\)).

# Estimate the SAR model using the Maximum likelihood estimator
SAR <- lagsarlm(Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole +
                  Media_stud_scuola + Studenti, tn, listw=dnb16.list, 
                na.action = na.ignore)
summary(SAR)
## 
## Call:lagsarlm(formula = Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + 
##     Scuole + Media_stud_scuola + Studenti, data = tn, listw = dnb16.list, 
##     na.action = na.ignore)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0642265 -0.0109883  0.0020149  0.0121413  0.0363773 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##                      Estimate  Std. Error z value  Pr(>|z|)
## (Intercept)        1.4054e-01  2.9817e-02  4.7134 2.437e-06
## Stud_Pop_under_20 -1.9183e-03  2.8888e-03 -0.6641   0.50665
## Scuole             1.8322e-03  1.0652e-03  1.7201   0.08542
## Media_stud_scuola  1.3440e-04  3.0496e-05  4.4073 1.047e-05
## Studenti          -1.2089e-05  6.2508e-06 -1.9339   0.05312
## 
## Rho: 0.012058, LR test value: 0.0042347, p-value: 0.94811
## Asymptotic standard error: 0.1932
##     z-value: 0.06241, p-value: 0.95024
## Wald statistic: 0.003895, p-value: 0.95024
## 
## Log likelihood: 429.3008 for lag model
## ML residual variance (sigma squared): 0.00033206, (sigma: 0.018223)
## Number of observations: 166 
## Number of parameters estimated: 7 
## AIC: -844.6, (AIC for lm: -846.6)
## LM test for residual autocorrelation
## test value: 0.032182, p-value: 0.85763

Now we skip to the local spillover specifications, starting from SDEM, which shows no significance in terms of p-value, both of the model and about the predictors (except for the mean students per school).

# Estimate the SDEM model using the Maximum likelihood estimator
SDEM <- errorsarlm(Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole +
                     Media_stud_scuola + Studenti, tn, listw=dnb16.list,
                   etype = "emixed", na.action = na.ignore)
summary(SDEM)
## 
## Call:errorsarlm(formula = Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + 
##     Scuole + Media_stud_scuola + Studenti, data = tn, listw = dnb16.list, 
##     na.action = na.ignore, etype = "emixed")
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0627857 -0.0108818  0.0018071  0.0116223  0.0395947 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                          Estimate  Std. Error z value  Pr(>|z|)
## (Intercept)            1.5045e-01  8.5569e-03 17.5828 < 2.2e-16
## Stud_Pop_under_20     -1.9402e-03  2.9856e-03 -0.6499    0.5158
## Scuole                 1.5784e-03  1.0839e-03  1.4562    0.1453
## Media_stud_scuola      1.2962e-04  3.2701e-05  3.9636 7.382e-05
## Studenti              -1.0489e-05  6.3811e-06 -1.6438    0.1002
## lag.Stud_Pop_under_20  1.1819e-04  1.3107e-02  0.0090    0.9928
## lag.Scuole            -5.1901e-03  4.0476e-03 -1.2823    0.1998
## lag.Media_stud_scuola  1.3856e-05  9.0424e-05  0.1532    0.8782
## lag.Studenti           3.2381e-05  2.4537e-05  1.3197    0.1870
## 
## Lambda: -0.029951, LR test value: 0.01977, p-value: 0.88818
## Asymptotic standard error: 0.22105
##     z-value: -0.1355, p-value: 0.89222
## Wald statistic: 0.01836, p-value: 0.89222
## 
## Log likelihood: 430.3279 for error model
## ML residual variance (sigma squared): 0.00032796, (sigma: 0.01811)
## Number of observations: 166 
## Number of parameters estimated: 11 
## AIC: -838.66, (AIC for lm: -840.64)

The same happens for the SEM, except for the high relevance of the predictor “mean of students per school”.

# Estimate the SEM model using the Maximum likelihood estimator
SEM <- errorsarlm(Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole +
                    Media_stud_scuola + Studenti, tn, listw=dnb16.list, na.action = na.ignore)
summary(SEM)
## 
## Call:errorsarlm(formula = Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + 
##     Scuole + Media_stud_scuola + Studenti, data = tn, listw = dnb16.list, 
##     na.action = na.ignore)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0642065 -0.0109760  0.0020477  0.0121882  0.0363688 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                      Estimate  Std. Error z value  Pr(>|z|)
## (Intercept)        1.4239e-01  3.0860e-03 46.1419 < 2.2e-16
## Stud_Pop_under_20 -1.9404e-03  2.8732e-03 -0.6753   0.49947
## Scuole             1.8263e-03  1.0649e-03  1.7150   0.08634
## Media_stud_scuola  1.3481e-04  3.0092e-05  4.4799 7.468e-06
## Studenti          -1.2053e-05  6.2492e-06 -1.9288   0.05376
## 
## Lambda: -0.0017038, LR test value: 6.7401e-05, p-value: 0.99345
## Asymptotic standard error: 0.21767
##     z-value: -0.0078273, p-value: 0.99375
## Wald statistic: 6.1267e-05, p-value: 0.99375
## 
## Log likelihood: 429.2987 for error model
## ML residual variance (sigma squared): 0.00033207, (sigma: 0.018223)
## Number of observations: 166 
## Number of parameters estimated: 7 
## AIC: -844.6, (AIC for lm: -846.6)

On the other hand, the LDM shows the same relevance for predictors as SEM, SAR and SDM, but a low p-value, demonstrating how significant the estimation is.

# Estimate the LDM model using the OLS estimator
LDM <- lmSLX(Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole +
               Media_stud_scuola + Studenti,tn, listw=dnb16.list, na.action = na.ignore)
summary(LDM)
## 
## Call:
## lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1], collapse = "+"))), 
##     data = as.data.frame(x), weights = weights)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.062864 -0.010759  0.001845  0.011685  0.039588 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.505e-01  8.931e-03  16.848  < 2e-16 ***
## Stud_Pop_under_20     -1.933e-03  3.071e-03  -0.629 0.530002    
## Scuole                 1.588e-03  1.115e-03   1.424 0.156535    
## Media_stud_scuola      1.296e-04  3.361e-05   3.855 0.000168 ***
## Studenti              -1.055e-05  6.568e-06  -1.606 0.110304    
## lag.Stud_Pop_under_20  3.293e-04  1.364e-02   0.024 0.980766    
## lag.Scuole            -5.236e-03  4.213e-03  -1.243 0.215756    
## lag.Media_stud_scuola  1.424e-05  9.374e-05   0.152 0.879431    
## lag.Studenti           3.257e-05  2.552e-05   1.276 0.203697    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01862 on 157 degrees of freedom
## Multiple R-squared:  0.189,  Adjusted R-squared:  0.1477 
## F-statistic: 4.573 on 8 and 157 DF,  p-value: 5e-05

The calculation of impacts for spatial lag and spatial Durbin models is needed in order to interpret the regression coefficients correctly, because of the spillovers between the terms in these data generation processes12 13.

The method impacts returns the average direct, indirect and total impacts for the variables in the model, for the variables themselves in the spatial lag model case, for the variables and their spatial lags in the spatial Durbin (mixed) model case14. The indirect impact represents a proper measure of (global) spatial spillover. Indeed, it expresses, on average, how a change in \(x\) reverberates on \(y\) all over the sample.

SAR impacts on the mean of students per school are significant due to a low p-value. On average, because of spatial spillovers, a change in a neighbourhood would lead to a cumulative increase of \(0.0001360426\) (in SAR case) in the ratio of the population under 20 over the total population.

impSAR <- impacts(SAR, listw=dnb16.list, R=100)
summary(impSAR, zstats=TRUE, short=TRUE)
## Impact measures (lag, exact):
##                          Direct      Indirect         Total
## Stud_Pop_under_20 -1.918366e-03 -2.339570e-05 -0.0019417614
## Scuole             1.832210e-03  2.234498e-05  0.0018545548
## Media_stud_scuola  1.344034e-04  1.639136e-06  0.0001360426
## Studenti          -1.208867e-05 -1.474291e-07 -0.0000122361
## ========================================================
## Simulation results ( variance matrix):
## ========================================================
## Simulated standard errors
##                         Direct     Indirect        Total
## Stud_Pop_under_20 3.202421e-03 5.901830e-04 3.174162e-03
## Scuole            1.105073e-03 5.160253e-04 1.248360e-03
## Media_stud_scuola 3.015550e-05 3.036230e-05 4.030213e-05
## Studenti          6.389843e-06 3.354847e-06 7.482681e-06
## 
## Simulated z-values:
##                       Direct   Indirect      Total
## Stud_Pop_under_20 -0.5162212  0.1134413 -0.4997244
## Scuole             1.7180983  0.1386092  1.5781901
## Media_stud_scuola  4.4205826  0.1185988  3.3969871
## Studenti          -1.9472922 -0.1477848 -1.7291512
## 
## Simulated p-values:
##                   Direct     Indirect Total     
## Stud_Pop_under_20 0.605700   0.90968  0.61726917
## Scuole            0.085779   0.88976  0.11452194
## Media_stud_scuola 9.8435e-06 0.90559  0.00068132
## Studenti          0.051500   0.88251  0.08378204
#For the SD specification:
impSDM <- impacts(SDM, listw=dnb16.list, R=100)
summary(impSDM, zstats=TRUE, short=TRUE)
## Impact measures (mixed, exact):
##                          Direct      Indirect         Total
## Stud_Pop_under_20 -1.943515e-03  1.096480e-04 -1.833867e-03
## Scuole             1.572671e-03 -5.101625e-03 -3.528954e-03
## Media_stud_scuola  1.296571e-04  1.421741e-05  1.438745e-04
## Studenti          -1.045835e-05  3.180209e-05  2.134374e-05
## ========================================================
## Simulation results ( variance matrix):
## ========================================================
## Simulated standard errors
##                         Direct     Indirect        Total
## Stud_Pop_under_20 2.857111e-03 1.243734e-02 1.259394e-02
## Scuole            1.153693e-03 4.390650e-03 4.536747e-03
## Media_stud_scuola 3.551301e-05 9.545531e-05 9.236712e-05
## Studenti          6.777282e-06 2.614804e-05 2.715157e-05
## 
## Simulated z-values:
##                       Direct    Indirect       Total
## Stud_Pop_under_20 -0.6610994  0.13418199 -0.01746614
## Scuole             1.2623765 -1.12081663 -0.76370107
## Media_stud_scuola  3.7300154  0.01641076  1.45106370
## Studenti          -1.4534234  1.17652243  0.77024967
## 
## Simulated p-values:
##                   Direct     Indirect Total  
## Stud_Pop_under_20 0.50854857 0.89326  0.98606
## Scuole            0.20681335 0.26237  0.44505
## Media_stud_scuola 0.00019147 0.98691  0.14676
## Studenti          0.14610624 0.23939  0.44115

The Lagrange multiplier (LM) test of spatial dependence on OLS residuals. In the LM test, the alternative hypothesis is explicitly considered to contrast the null hypothesis about the absence of spatial dependence. In particular, we can explicitly express the alternative hypothesis either in the form of a SL or of a SEM.

The function lm.LMtests() reports the estimates of tests chosen among 4 statistics for testing for spatial dependence in linear models, which are:

  • LMerr: LM test for error dependence;
  • LMlag, simple LM test for a missing spatially lagged dependent variable;
  • their robust version, called RLMerr and RLMlag.

According to this diagnostic, no model has been found significant by looking at the p-value.

OLSmodel <- lm(Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole + 
                 Media_stud_scuola + Studenti, data = tn, na.action = na.ignore)
natOLSlmTests <- lm.LMtests(OLSmodel, dnb16.list, 
                    test=c("LMerr", "LMlag", "RLMerr", "RLMlag"))
summary(natOLSlmTests)
##  Lagrange multiplier diagnostics for spatial dependence
## data:  
## model: lm(formula = Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole +
## Media_stud_scuola + Studenti, data = tn, na.action = na.ignore)
## weights: dnb16.list
##  
##         statistic parameter p.value
## LMerr  7.4034e-05         1  0.9931
## LMlag  4.6569e-03         1  0.9456
## RLMerr 2.4200e-02         1  0.8764
## RLMlag 2.8783e-02         1  0.8653

In conclusion, a positive significant total impact on the proportion of population under 20 over the total population has been found through the mean number of students per school, implying that higher levels of potential students in the territory lead to the increase of mean students per school in the same neighbourhood.

Mean of students per class

Again, we can repeat the same analysis with the mean of students per class, using the same predictors used in the OLS residuals section (i.e. optimal model).

Starting from SDM, all predictors seem to gain high relevance given the low p-value, while their lagged values are relevant only for the mean students per school and slightly for the number of classes. The overall p-value of the model is below \(0.05\) and is therefore considered statistically significant.

SDM <- lagsarlm(Media_stud_classe ~ Stud_Pop_under_20 + Scuole + Classi + 
                  Media_stud_scuola + Studenti, tn, listw=dnb16.list,
                type="mixed", na.action = na.ignore)
summary(SDM)
## 
## Call:lagsarlm(formula = Media_stud_classe ~ Stud_Pop_under_20 + Scuole + 
##     Classi + Media_stud_scuola + Studenti, data = tn, listw = dnb16.list, 
##     na.action = na.ignore, type = "mixed")
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -10.46096  -1.37375   0.13997   1.23705   8.03899 
## 
## Type: mixed 
## Coefficients: (asymptotic standard errors) 
##                         Estimate Std. Error z value  Pr(>|z|)
## (Intercept)            7.2396823  1.7232603  4.2012 2.656e-05
## Stud_Pop_under_20      6.2508896  0.4025277 15.5291 < 2.2e-16
## Scuole                 0.8868602  0.1384118  6.4074 1.480e-10
## Classi                -0.4000850  0.0533333 -7.5016 6.306e-14
## Media_stud_scuola      0.0403896  0.0040817  9.8952 < 2.2e-16
## Studenti               0.0150002  0.0026879  5.5806 2.397e-08
## lag.Stud_Pop_under_20  2.4979235  2.3014873  1.0854   0.27777
## lag.Scuole             0.8718620  0.5773390  1.5101   0.13101
## lag.Classi            -0.5051890  0.2794296 -1.8079   0.07062
## lag.Media_stud_scuola  0.0391793  0.0167894  2.3336   0.01962
## lag.Studenti           0.0211662  0.0135638  1.5605   0.11864
## 
## Rho: -0.48062, LR test value: 3.3873, p-value: 0.0657
## Asymptotic standard error: 0.25674
##     z-value: -1.872, p-value: 0.061205
## Wald statistic: 3.5044, p-value: 0.061205
## 
## Log likelihood: -370.3206 for mixed model
## ML residual variance (sigma squared): 5.011, (sigma: 2.2385)
## Number of observations: 166 
## Number of parameters estimated: 13 
## AIC: 766.64, (AIC for lm: 768.03)
## LM test for residual autocorrelation
## test value: 7.016, p-value: 0.0080785

In SAR case instead, despite all predictors seem to have high relevance, the spatial autoregressive parameter \(\rho\) (Rho) is not significant due to high p-value on both the asymptotic t-test and Likelihood ratio test, but the LM test for residual autocorrelation has low p-value, indicating the presence of spatial autocorrelation in the residuals.

SAR <- lagsarlm(Media_stud_classe ~ Stud_Pop_under_20 + Scuole + Classi + 
                  Media_stud_scuola + Studenti, tn, listw=dnb16.list, na.action = na.ignore)
summary(SAR)
## 
## Call:lagsarlm(formula = Media_stud_classe ~ Stud_Pop_under_20 + Scuole + 
##     Classi + Media_stud_scuola + Studenti, data = tn, listw = dnb16.list, 
##     na.action = na.ignore)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -10.873538  -1.269800   0.085934   1.360694   8.269603 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##                     Estimate Std. Error z value  Pr(>|z|)
## (Intercept)        4.7313801  0.9681310  4.8871 1.023e-06
## Stud_Pop_under_20  6.1795749  0.4099071 15.0755 < 2.2e-16
## Scuole             0.9236421  0.1386949  6.6595 2.747e-11
## Classi            -0.4000100  0.0538058 -7.4343 1.050e-13
## Media_stud_scuola  0.0418992  0.0040209 10.4204 < 2.2e-16
## Studenti           0.0147771  0.0027073  5.4583 4.808e-08
## 
## Rho: 0.10558, LR test value: 1.8654, p-value: 0.172
## Asymptotic standard error: 0.082328
##     z-value: 1.2824, p-value: 0.19969
## Wald statistic: 1.6446, p-value: 0.19969
## 
## Log likelihood: -374.3558 for lag model
## ML residual variance (sigma squared): 5.3213, (sigma: 2.3068)
## Number of observations: 166 
## Number of parameters estimated: 8 
## AIC: 764.71, (AIC for lm: 764.58)
## LM test for residual autocorrelation
## test value: 3.4037, p-value: 0.065052

In case of local spillover specifications, all predictors are relevant, but not their lagged version. The \(\lambda\) (Lambda) estimator is highly significant due to low p-value on both Likelihood ratio and asymptotic t-tests.

SDEM <- errorsarlm(Media_stud_classe ~ Stud_Pop_under_20 + Scuole + 
                     Classi + Media_stud_scuola + Studenti, tn, listw=dnb16.list,
                   etype = "emixed", na.action = na.ignore)
summary(SDEM)
## 
## Call:errorsarlm(formula = Media_stud_classe ~ Stud_Pop_under_20 + 
##     Scuole + Classi + Media_stud_scuola + Studenti, data = tn, 
##     listw = dnb16.list, na.action = na.ignore, etype = "emixed")
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -10.131534  -1.433884   0.091931   1.202291   7.824475 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                         Estimate Std. Error z value  Pr(>|z|)
## (Intercept)            4.6960331  0.7869333  5.9675 2.409e-09
## Stud_Pop_under_20      6.2832091  0.4057626 15.4849 < 2.2e-16
## Scuole                 0.8754070  0.1388889  6.3029 2.921e-10
## Classi                -0.3940446  0.0529251 -7.4453 9.681e-14
## Media_stud_scuola      0.0389716  0.0041323  9.4309 < 2.2e-16
## Studenti               0.0147465  0.0026692  5.5248 3.299e-08
## lag.Stud_Pop_under_20 -0.2078575  1.2778245 -0.1627   0.87078
## lag.Scuole             0.4022850  0.4078400  0.9864   0.32395
## lag.Classi            -0.3291295  0.2278508 -1.4445   0.14860
## lag.Media_stud_scuola  0.0170222  0.0102905  1.6542   0.09809
## lag.Studenti           0.0149747  0.0112332  1.3331   0.18251
## 
## Lambda: -0.73637, LR test value: 6.2288, p-value: 0.012569
## Asymptotic standard error: 0.27499
##     z-value: -2.6778, p-value: 0.00741
## Wald statistic: 7.1708, p-value: 0.00741
## 
## Log likelihood: -368.8999 for error model
## ML residual variance (sigma squared): 4.8531, (sigma: 2.203)
## Number of observations: 166 
## Number of parameters estimated: 13 
## AIC: 763.8, (AIC for lm: 768.03)

In case of SEM instead, the estimator is not statistically significant in both tests due to p-value around \(0.44\).

SEM <- errorsarlm(Media_stud_classe ~ Stud_Pop_under_20 + Scuole + 
                    Classi + Media_stud_scuola + Studenti, tn, listw=dnb16.list, 
                  na.action = na.ignore)
summary(SEM)
## 
## Call:errorsarlm(formula = Media_stud_classe ~ Stud_Pop_under_20 + 
##     Scuole + Classi + Media_stud_scuola + Studenti, data = tn, 
##     listw = dnb16.list, na.action = na.ignore)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -10.65336  -1.27924   0.12887   1.56630   8.46921 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                     Estimate Std. Error z value  Pr(>|z|)
## (Intercept)        5.7471139  0.3762679 15.2740 < 2.2e-16
## Stud_Pop_under_20  6.0255400  0.4024695 14.9714 < 2.2e-16
## Scuole             0.9477568  0.1390736  6.8148 9.441e-12
## Classi            -0.4050242  0.0540458 -7.4941 6.684e-14
## Media_stud_scuola  0.0446944  0.0037651 11.8706 < 2.2e-16
## Studenti           0.0148997  0.0027231  5.4715 4.462e-08
## 
## Lambda: -0.18161, LR test value: 0.52462, p-value: 0.46888
## Asymptotic standard error: 0.23753
##     z-value: -0.7646, p-value: 0.44451
## Wald statistic: 0.58461, p-value: 0.44451
## 
## Log likelihood: -375.0262 for error model
## ML residual variance (sigma squared): 5.3582, (sigma: 2.3148)
## Number of observations: 166 
## Number of parameters estimated: 8 
## AIC: 766.05, (AIC for lm: 764.58)

Also for LDM predictors are relevant, except for their lagged values. The p-value is below the \(0.05\) threshold, concluding the significance of this model.

LDM <- lmSLX(Media_stud_classe ~ Stud_Pop_under_20 + Scuole + 
               Classi + Media_stud_scuola + Studenti,tn, listw=dnb16.list, 
             na.action = na.ignore)
summary(LDM)
## 
## Call:
## lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1], collapse = "+"))), 
##     data = as.data.frame(x), weights = weights)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.7540  -1.2978   0.0067   1.2361   8.0782 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.863070   1.150983   4.225 4.06e-05 ***
## Stud_Pop_under_20      6.234265   0.423406  14.724  < 2e-16 ***
## Scuole                 0.880004   0.145556   6.046 1.07e-08 ***
## Classi                -0.403215   0.056095  -7.188 2.61e-11 ***
## Media_stud_scuola      0.040413   0.004287   9.427  < 2e-16 ***
## Studenti               0.015188   0.002827   5.372 2.81e-07 ***
## lag.Stud_Pop_under_20 -0.552015   1.772209  -0.311    0.756    
## lag.Scuole             0.397542   0.545942   0.728    0.468    
## lag.Classi            -0.311573   0.270806  -1.151    0.252    
## lag.Media_stud_scuola  0.015488   0.011944   1.297    0.197    
## lag.Studenti           0.013944   0.013606   1.025    0.307    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.355 on 155 degrees of freedom
## Multiple R-squared:  0.8962, Adjusted R-squared:  0.8895 
## F-statistic: 133.8 on 10 and 155 DF,  p-value: < 2.2e-16

All predictors’ impacts are significant, if looking at direct and total impacts. Results show a particular significance for:

  • the number of students over the population under 20: on average, because of spatial spillovers, a change in a neighbourhood would lead to a cumulative increase of nearly \(7\) students per class;
  • the number of classes: on average, because of spatial spillovers, a change in a neighbourhood would also lead to a cumulative decrease of \(0.45\) (half student per class);
  • the number of schools: on average, because of spatial spillovers, a change in a neighbourhood would also lead to a cumulative increase of slightly more than \(1\) (nearly one student per class).

Also, a cumulative increase will happen by changing neighbourhoods if considering an increase in terms of the mean of students per school and number of students, with lower impact than the previous three mentioned predictors.

impSAR <- impacts(SAR, listw=dnb16.list, R=100)
summary(impSAR, zstats=TRUE, short=TRUE)
## Impact measures (lag, exact):
##                        Direct     Indirect       Total
## Stud_Pop_under_20  6.18405109  0.724978490  6.90902958
## Scuole             0.92431115  0.108360311  1.03267146
## Classi            -0.40029979 -0.046928580 -0.44722837
## Media_stud_scuola  0.04192952  0.004915548  0.04684507
## Studenti           0.01478784  0.001733631  0.01652147
## ========================================================
## Simulation results ( variance matrix):
## ========================================================
## Simulated standard errors
##                        Direct    Indirect       Total
## Stud_Pop_under_20 0.442176257 0.592203301 0.886915186
## Scuole            0.134286262 0.079912089 0.149830619
## Classi            0.047158179 0.037836209 0.064096044
## Media_stud_scuola 0.004289637 0.003803754 0.004620418
## Studenti          0.002478835 0.001464724 0.003185327
## 
## Simulated z-values:
##                      Direct  Indirect     Total
## Stud_Pop_under_20 13.956471  1.130183  7.712708
## Scuole             6.749132  1.155522  6.665232
## Classi            -8.546666 -1.129278 -6.954763
## Media_stud_scuola  9.907408  1.135666 10.133060
## Studenti           6.065368  1.103835  5.227677
## 
## Simulated p-values:
##                   Direct     Indirect Total     
## Stud_Pop_under_20 < 2.22e-16 0.25840  1.2212e-14
## Scuole            1.4873e-11 0.24788  2.6425e-11
## Classi            < 2.22e-16 0.25878  3.5316e-12
## Media_stud_scuola < 2.22e-16 0.25610  < 2.22e-16
## Studenti          1.3165e-09 0.26966  1.7165e-07

In this second case, the p-value is higher than before, but still implying significance mainly for students over population under \(20\) (increase), number of schools (increase) and number of classes (decrease).

#For the SD specification:
impSDM <- impacts(SDM, listw=dnb16.list, R=100)
summary(impSDM, zstats=TRUE, short=TRUE)
## Impact measures (mixed, exact):
##                        Direct     Indirect       Total
## Stud_Pop_under_20  6.26298708 -0.354101931  5.90888515
## Scuole             0.87621431  0.311613916  1.18782822
## Classi            -0.39260974 -0.218805756 -0.61141549
## Media_stud_scuola  0.03991738  0.013822906  0.05374028
## Studenti           0.01466678  0.009759782  0.02442656
## ========================================================
## Simulation results ( variance matrix):
## ========================================================
## Simulated standard errors
##                        Direct    Indirect       Total
## Stud_Pop_under_20 0.373419564 1.309149836 1.260999246
## Scuole            0.157184587 0.347006745 0.369748942
## Classi            0.052103876 0.180123295 0.181889842
## Media_stud_scuola 0.004242590 0.008948923 0.008183575
## Studenti          0.002623351 0.008959167 0.008932699
## 
## Simulated z-values:
##                      Direct   Indirect     Total
## Stud_Pop_under_20 16.808074 -0.2589755  4.708509
## Scuole             5.585806  1.0887202  3.396347
## Classi            -7.601943 -1.3778486 -3.542107
## Media_stud_scuola  9.367406  1.5743938  6.577956
## Studenti           5.649340  1.2146089  2.877303
## 
## Simulated p-values:
##                   Direct     Indirect Total     
## Stud_Pop_under_20 < 2.22e-16 0.79565  2.4954e-06
## Scuole            2.3262e-08 0.27628  0.00068292
## Classi            2.9088e-14 0.16825  0.00039694
## Media_stud_scuola < 2.22e-16 0.11540  4.7696e-11
## Studenti          1.6107e-08 0.22452  0.00401090

Focusing on the Lagrange multiplier diagnostics for spatial dependence, let’s try focusing on the same model but reducing the number of predictors to those with higher impacts on both SEM and SDM models.

According to the summary, all statistics are significant, except for Robust Linear Model test for error dependence (RLMerr).

OLSmodel <- lm(Media_stud_classe ~ Stud_Pop_under_20 + Scuole + Classi, data = tn, na.action = na.ignore)
natOLSlmTests <- lm.LMtests(OLSmodel, dnb16.list, 
                    test=c("LMerr", "LMlag", "RLMerr", "RLMlag"))
summary(natOLSlmTests)
##  Lagrange multiplier diagnostics for spatial dependence
## data:  
## model: lm(formula = Media_stud_classe ~ Stud_Pop_under_20 + Scuole +
## Classi, data = tn, na.action = na.ignore)
## weights: dnb16.list
##  
##        statistic parameter   p.value    
## LMerr   13.68278         1 0.0002164 ***
## LMlag   24.24549         1 8.481e-07 ***
## RLMerr   0.28245         1 0.5950979    
## RLMlag  10.84515         1 0.0009905 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to Elhorst (2010)15, a valid comparison between two models can be made through an inferential strategy. The best specification under the ML estimation, we should:

  1. Estimate the OLS model and test (with the LM test) whether the SAR or the SEM is more appropriate to describe the data;

  2. If the OLS model is rejected in favour of SAR, the SEM or in favour of both models, then the SDM should be estimated;

  3. Likelihood ratio (LR) tests can subsequently be used to test whether:

    • the SDM can be simplified to SAR,
    • whether SDM can be simplified to the SEM.

If both hypotheses are rejected, then the SDM best describes the data. If hypothesis i) cannot be rejected, then the SAR best describes the data.

If hypothesis ii) cannot be rejected, then the SEM best describes the data, provided that the (robust) LM tests also pointed to the SEM. We can perform LR tests of restrictions on the parameters of spatial models using the function anova().

With the likelihood ratio test, we can see if the SAR can be simplified by the SDM, which is false due to the high p-value. On the other hand, performing the same test on SDM and SEM makes us come to the conclusion that SDM cannot be simplified by SEM, because of the p-value above \(0.05\) threshold.

# Test hypothesis i)
anova(SAR, SDM)
##     Model df    AIC  logLik Test L.Ratio p-value
## SAR     1  8 764.71 -374.36    1                
## SDM     2 13 766.64 -370.32    2  8.0703 0.15241
# Test hypothesis ii)
anova(SDM, SEM)
##     Model df    AIC  logLik Test L.Ratio p-value
## SDM     1 13 766.64 -370.32    1                
## SEM     2  8 766.05 -375.03    2  9.4111 0.09375

Since both hypothesis are rejected, SDM best describes data.

In conclusion, a positive significant impact on the mean students per class has been found through the number of students over the population under 20 and the number of schools, implying that higher levels of students per class in the territory lead to the increase in the number of schools and students ratio over young population in the same neighbourhood. The same applies to the number of classes, with a negative impact since the more classes there are, the more spread students will be among them.


  1. Tobler, Waldo R. 1970. “A Computer Movie Simulating Urban Growth in the Detroit Region.” Economic Geography 46 (2): 234–40. http://www.geog.ucsb.edu/~tobler/publications/pdf_docs/A-Computer-Movie.pdf↩︎

  2. Wikipedia page of Centroid, https://en.wikipedia.org/wiki/Centroid↩︎

  3. How to find the centre of a polygon in Python, https://deparkes.co.uk/2015/02/28/how-to-find-the-centre-of-a-polygon-in-python/↩︎

  4. saveGIF() documentation https://www.rdocumentation.org/packages/animation/versions/2.4.1/topics/saveGIF↩︎

  5. Spatial regression models documentation, https://rspatial.org/raster/analysis/7-spregression.html↩︎

  6. Spatial Autocorrelation, https://ibis.geog.ubc.ca/courses/geob479/notes/spatial_analysis/spatial_autocorrelation.htm#↩︎

  7. Anselin Luc. 1995. “Local Indicators of Spatial Association—LISA.” Geographical Analysis 27 (2): 93–115.↩︎

  8. Moran, P. A. P. 1950. “Notes on Continuous Stochastic Phenomena.”, Biometrika 37 (1/2): 17–23↩︎

  9. LeSage, J., and R.K. Pace. 2009. Introduction to Spatial Econometrics. Statistics: A Series of Textbooks and Monographs. CRC Press↩︎

  10. LeSage, J. 2014. “What Regional Scientists Need to Know About Spatial Econometrics.” Review of Regional Studies 44 (1): 13–32↩︎

  11. lagsarlm documentation, https://cran.r-project.org/web/packages/spatialreg/spatialreg.pdf↩︎

  12. spdep Documentation,https://www.rdocumentation.org/packages/spdep/versions/1.0-2/topics/impacts↩︎

  13. LeSage J and RK Pace (2009) Introduction to Spatial Econometrics. CRC Press, Boca Raton, pp. 33–42, 114–115; LeSage J and MM Fischer (2008) Spatial growth regressions: model specification, estimation and interpretation. Spatial Economic Analysis 3 (3), pp. 275–304↩︎

  14. Roger Bivand, Gianfranco Piras (2015). Comparing Implementations of Estimation Methods for Spatial Econometrics. Journal of Statistical Software, 63(18), 1-36. https://www.jstatsoft.org/v63/i18/↩︎

  15. Elhorst, J. P. 2010. “Applied Spatial Econometrics: Raising the Bar.” Spatial Economic Analysis 5 (1). Routledge: 9–28.↩︎